Chinese Informal Word Normalization: an Experimental Study

نویسندگان

  • Aobo Wang
  • Min-Yen Kan
  • Daniel Andrade
  • Takashi Onishi
  • Kai Ishikawa
چکیده

We study the linguistic phenomenon of informal words in the domain of Chinese microtext and present a novel method for normalizing Chinese informal words to their formal equivalents. We formalize the task as a classification problem and propose rule-based and statistical features to model three plausible channels that explain the connection between formal and informal pairs. Our two-stage selection-classification model is evaluated on a crowdsourced corpus and achieves a normalization precision of 89.5% across the different channels, significantly improving the state-of-the-art.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation

Social media texts are written in an informal style, which hinders other natural language processing (NLP) applications such as machine translation. Text normalization is thus important for processing of social media text. Previous work mostly focused on normalizing words by replacing an informal word with its formal form. In this paper, to further improve other downstream NLP applications, we ...

متن کامل

Adaptive Parser-Centric Text Normalization

Text normalization is an important first step towards enabling many Natural Language Processing (NLP) tasks over informal text. While many of these tasks, such as parsing, perform the best over fully grammatically correct text, most existing text normalization approaches narrowly define the task in the word-to-word sense; that is, the task is seen as that of mapping all out-of-vocabulary non-st...

متن کامل

Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation

We address the problem of informal word recognition in Chinese microblogs. A key problem is the lack of word delimiters in Chinese. We exploit this reliance as an opportunity: recognizing the relation between informal word recognition and Chinese word segmentation, we propose to model the two tasks jointly. Our joint inference method significantly outperforms baseline systems that conduct the t...

متن کامل

Segmenting Chinese Microtext: Joint Informal-Word Detection and Segmentation with Neural Networks

State-of-the-art Chinese word segmentation systems typically exploit supervised models trained on a standard manually-annotated corpus, achieving performances over 95% on a similar standard testing corpus. However, the performances may drop significantly when the same models are applied onto Chinese microtext. One major challenge is the issue of informal words in the microtext. Previous studies...

متن کامل

Improving Text Normalization by Optimizing Nearest Neighbor Matching

Text normalization is an essential task in the processing and analysis of social media that is dominated with informal writing. It aims to map informal words to their intended standard forms. In this paper, we present an automatic optimization-based nearest neighbor matching approach for text normalization. This approach is motivated by the observation that text normalization is essentially a m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013